suppressPackageStartupMessages({
library(SummarizedExperiment)
library(pheatmap)
library(RColorBrewer)
library(stringr)
library(plotly)
library(tidyverse)
})Bulk RNA-seq Quantification
In this tutorial, we look at the results of bulk RNA-seq quantification. Our data comes from the following paper.
It is not important to understand the paper in detail. Just know we are dealing with two colon cancer cell-lines “HCT116” and “SW480” which have been FACS-sorted according to the expression of CD44 and EpCAM, “EpCAMhigh” and “EpCAMlow”, resulting in 4 groups with 4 replicates each.
HCT116: EpCAMhigh
HCT116: EpCAMlow
SW480: EpCAMhigh
SW480: EpCAMlow
We will focus on the tools available for QC rather than the underlying biology.
Setup
Load required packages
Downloading counts
First, let’s fetch the input count files generated from featureCounts for each sample.
ftpLocation <- "https://fgcz-gstore.uzh.ch/public/OmicsInOncology_CPS341nBio435_FS25_week3/feature_counts"
countDirectoryToUse <- "feature_counts"
dir.create(countDirectoryToUse, showWarnings = FALSE)
sampleNames <- c(
"HCT116_EpCAMhigh_1",
"HCT116_EpCAMhigh_2",
"HCT116_EpCAMhigh_3",
"HCT116_EpCAMhigh_4",
"HCT116_EpCAMlow_1",
"HCT116_EpCAMlow_2",
"HCT116_EpCAMlow_3",
"HCT116_EpCAMlow_4",
"SW480_EpCAMhigh_1",
"SW480_EpCAMhigh_2",
"SW480_EpCAMhigh_3",
"SW480_EpCAMhigh_4",
"SW480_EpCAMlow_1",
"SW480_EpCAMlow_2",
"SW480_EpCAMlow_3",
"SW480_EpCAMlow_4"
)
countFilenames <- paste0(sampleNames, ".txt")
for (fn in countFilenames) {
download.file(url=file.path(ftpLocation, fn),
destfile=file.path(countDirectoryToUse, fn))
}You will notice a new directory called “feature_counts” in the same folder where you executed this code. Take a quick look inside.
How were these counts generated?
Trimming with
fastpAlignment with
STARto the GRCh38.p13 reference genomeRunning featureCounts from the Rsubread package using the GENCODE Release 42 GTF file. The file used can be found here on our public FTP
The file is quite large and you need not be familiar with its format at this point or download it. What is important to know is that this file has been subset to only the protein_coding genes.
There are a few reasons why we might rather exclude transcript types such as lncRNA and snRNA.
Including these other transcript types increase the number of tests, meaning more false positives, and less power to detect protein-coding genes
non-coding RNAs do not have poly-A tails and can’t be detected with mRNA sequencing, so their absence might be misinterpreted
They are sometimes not well annotated and this can adversely affect enrichment analysis
In general, a good strategy is to exclude them by default and include them by demand (i.e. if the research question explicitly demands it).
Loading counts
Now that the we have generated the counts, we can start exploring the data. We could proceed with either the counts generated by Kallisto or featureCounts. Let’s load the featureCounts for now.
# Define meta dataframe for later use
meta <- data.frame(
Condition=as.factor(c(rep("HCT116_EpCAMhigh", 4),
rep("HCT116_EpCAMlow", 4),
rep("SW480_EpCAMhigh", 4),
rep("SW480_EpCAMlow", 4))),
row.names=sampleNames
)
# Define some general-use parameters for use later
sigThresh <- 10 # significane threshold for the counts
conditionColours <- scales::hue_pal()(length(unique(meta$Condition))) # choose pretty colours
# associate the conditions with the pretty colours
names(conditionColours) <- unique(meta$Condition)
sampleColours <- conditionColours[meta$Condition]First, we load the counts and combine them into a single data-frame. Note we do this here manually, but there are methods like the tximport function which perform this operation in a convenient way, and beyond this, correct for biases in the counts. We will be using this method of import in the optional section on differential expression analysis.
# Find the count files on the system
countPaths <- Sys.glob(file.path(countDirectoryToUse, sprintf("*%s.txt", rownames(meta))))
names(countPaths) <- rownames(meta)
# Load into memory and combine
countList <- lapply(rownames(meta), function(sn) {
count <- data.table::fread(countPaths[[sn]]) %>%
as.data.frame() %>%
column_to_rownames(var="Identifier") %>%
dplyr::rename_with(~sn)
return(count)
})
rawCounts <- bind_cols(countList)
# Save merged data for later use
saveRDS(rawCounts, file=file.path(countDirectoryToUse, "mergedCounts.rds"))Count Statistics
Let’s look at the count statistics first. In the first of two plots, we get the total counts for each sample. In the second, we see how many features, and what fraction of all features, exceed our signal threshold.
toPlot <- tibble(
Sample=colnames(rawCounts),
`Read Counts`=colSums(rawCounts),
`Fraction of features above threshold`=colSums(rawCounts > sigThresh),
Percentages=paste(signif(100 * colSums(rawCounts > sigThresh) / nrow(rawCounts), digits=2), "%")
)
plot_ly(toPlot, x = ~Sample, y = ~`Read Counts`, type = "bar") %>%
layout(title="Total reads", yaxis = list(title = "Counts [Mio]"))plot_ly(toPlot, x = ~Sample, y = ~`Fraction of features above threshold`, type="bar",
text=~Percentages, textposition = 'auto') %>%
layout(title="Genomic features with reads above threshold", yaxis = list(title = "Count"))Exercise #2
- How do the read counts compare to the initial FastQC report? See the plot below.
- How many features (in %) are above threshold? Do you think re-sequencing the samples with relatively low counts will be useful or provide a significant return on investment?
The counts after mapping and quantification are lower.
All around 50-60%. Re-sequencing is unlikely to be worth our time, unless we are interested in very lowly-expressed genes. Even the sample with the lowest counts has almost as a considerable number of features with a count above, despite having only a fraction of the counts.
Filtering Counts
Before continuing, we should filter out genes which are lowly expressed or absent, which we do this on a per-group basis (think about why this might be). Next, we generate normalised counts using “Variance Stabilisation” as provided in the DESeq2 package. As per the documentation, this function calculates a variance stabilizing transformation (VST) from the fitted dispersion-mean relation(s) and then transforms the count data (normalized by division by the size factors or normalization factors), yielding a matrix of values which now have constant variance along the range of mean values (homoskedastic)(Love, Huber, and Anders 2014). There are other functions we could use, such as rlog which is less sensitive to size factors, but VST works well in the general case.
# First filter genes which are at expressed below threshold in less than half the samples within a group
isPresent <- rawCounts > sigThresh
isPresentCond <- rowsum(t(isPresent * 1), group=meta$Condition)
isPresentCond <- t(sweep(isPresentCond, 1,
table(meta$Condition)[rownames(isPresentCond)], FUN="/")) >= 0.5
isValid <- rowMeans(isPresentCond) >= 0.5
rawCountsFilt <- rawCounts[isValid, ]
# Load data into a DESeq2 dataset so we can use the variance stabilizing function from Deseq2
dds <- DESeq2::DESeqDataSetFromMatrix(countData=rawCountsFilt,
colData=meta,
design=~Condition)
vsd <- DESeq2::vst(dds)
# Extract normalized counts
vsdSE <- SummarizedExperiment::assay(vsd)Let’s quickly compare the raw and normalised counts to see how they have changed.
head(rawCountsFilt) HCT116_EpCAMhigh_1 HCT116_EpCAMhigh_2 HCT116_EpCAMhigh_3
ENSG00000187634 290 505 253
ENSG00000188976 8568 16492 9221
ENSG00000187961 1025 1914 871
ENSG00000187583 148 326 181
ENSG00000187642 22 58 27
ENSG00000188290 103 196 116
HCT116_EpCAMhigh_4 HCT116_EpCAMlow_1 HCT116_EpCAMlow_2
ENSG00000187634 252 125 371
ENSG00000188976 7034 3262 9800
ENSG00000187961 1052 360 1536
ENSG00000187583 148 54 234
ENSG00000187642 20 7 51
ENSG00000188290 122 45 194
HCT116_EpCAMlow_3 HCT116_EpCAMlow_4 SW480_EpCAMhigh_1
ENSG00000187634 185 101 5
ENSG00000188976 6339 2471 228
ENSG00000187961 870 304 12
ENSG00000187583 145 44 8
ENSG00000187642 23 7 1
ENSG00000188290 94 29 3
SW480_EpCAMhigh_2 SW480_EpCAMhigh_3 SW480_EpCAMhigh_4
ENSG00000187634 163 92 62
ENSG00000188976 6415 3448 3318
ENSG00000187961 300 205 170
ENSG00000187583 227 153 115
ENSG00000187642 22 18 15
ENSG00000188290 62 20 30
SW480_EpCAMlow_1 SW480_EpCAMlow_2 SW480_EpCAMlow_3
ENSG00000187634 74 26 73
ENSG00000188976 3243 1124 2175
ENSG00000187961 204 63 99
ENSG00000187583 65 29 22
ENSG00000187642 5 2 2
ENSG00000188290 21 4 7
SW480_EpCAMlow_4
ENSG00000187634 73
ENSG00000188976 2698
ENSG00000187961 149
ENSG00000187583 28
ENSG00000187642 3
ENSG00000188290 11
head(as.data.frame(vsdSE)) HCT116_EpCAMhigh_1 HCT116_EpCAMhigh_2 HCT116_EpCAMhigh_3
ENSG00000187634 8.672862 8.578033 8.651217
ENSG00000188976 12.768795 12.752728 13.028374
ENSG00000187961 9.972398 9.923392 9.912150
ENSG00000187583 8.153481 8.239491 8.379435
ENSG00000187642 7.266120 7.361191 7.365607
ENSG00000188290 7.923728 7.910310 8.065749
HCT116_EpCAMhigh_4 HCT116_EpCAMlow_1 HCT116_EpCAMlow_2
ENSG00000187634 8.590065 8.460559 8.411885
ENSG00000188976 12.553433 12.239514 12.170682
ENSG00000187961 10.054280 9.452903 9.787046
ENSG00000187583 8.183144 7.888802 8.082709
ENSG00000187642 7.251120 7.123178 7.352301
ENSG00000188290 8.054338 7.788190 7.964821
HCT116_EpCAMlow_3 HCT116_EpCAMlow_4 SW480_EpCAMhigh_1
ENSG00000187634 8.357132 8.357004 7.625560
ENSG00000188976 12.432240 11.966773 11.165563
ENSG00000187961 9.851438 9.357700 8.114016
ENSG00000187583 8.180735 7.818910 7.865215
ENSG00000187642 7.297409 7.141239 7.106193
ENSG00000188290 7.905483 7.611647 7.416489
SW480_EpCAMhigh_2 SW480_EpCAMhigh_3 SW480_EpCAMhigh_4
ENSG00000187634 7.886821 8.014233 7.862560
ENSG00000188976 11.648110 11.871018 12.008915
ENSG00000187961 8.283221 8.600562 8.560418
ENSG00000187583 8.090082 8.366243 8.257848
ENSG00000187642 7.131458 7.284304 7.271467
ENSG00000188290 7.435678 7.316746 7.512758
SW480_EpCAMlow_1 SW480_EpCAMlow_2 SW480_EpCAMlow_3
ENSG00000187634 7.692577 7.634180 7.861726
ENSG00000188976 11.300423 11.119901 11.216085
ENSG00000187961 8.310772 8.132933 8.044319
ENSG00000187583 7.630993 7.685959 7.339095
ENSG00000187642 6.944584 6.945512 6.876795
ENSG00000188290 7.225017 7.057033 7.051816
SW480_EpCAMlow_4
ENSG00000187634 7.690245
ENSG00000188976 11.068999
ENSG00000187961 8.097289
ENSG00000187583 7.311618
ENSG00000187642 6.884848
ENSG00000188290 7.075951
Dimensionality Reduction
We now have a count matrix of with the following dimensions: 16 x 14’392. In other words, we have represented each sample by a vector of 14’392 features. In order to adequately QC our samples, one important aspect is determining how similar or dissimilar one vector of genes from a specific sample is to that of another. However, interpreting 14’392 dimensions directly is not something that humans are generally capable of doing. This is where dimensionality reduction comes in. We would ideally like to reduce the given dimensions (a 14’392-dimensional space), down to a 2- or 3-dimensional space, so we can visualise the result.
Two approaches, which are in fact related, are PCA (principal component analysis) and MDS (multidimensional scaling).
PCA
PCA is a linear dimensionality reduction technique which, in essence, tries to find a rotation of the data in order to maximise the variance. This is a common method for dimensionality reduction across many different disciplines within Computer Science and beyond. We will use the built-in R-method prcomp below to calculate the principal components and manually calculate the variance explained by each component.
Additionally, we will plot a ‘scree’ plot, which aims to visualise the variance explained by each principal component.
# Run PCA
pcDat <- prcomp(t(vsdSE), scale. = FALSE)
# Calculate explained variance
varExp <- (100*pcDat$sdev^2)/sum(pcDat$sdev^2)
# Store the explained variance of top 8 PCs
varExp_df <- data.frame(PC= paste0("PC",1:8),
varExp=varExp[1:8])
# Scree plot
varExp_df %>%
ggplot(aes(x=PC,y=varExp, group=1)) +
geom_point(colour="steelblue", size=4) +
geom_col(fill="steelblue") +
geom_line() +
theme_bw() + ylim(c(0,100))Let’s now plot the first and second principal components in 2-dimensions.
plot_ly(as.data.frame(pcDat$x), x=~PC1, y=~PC2, color=meta$Condition, colors="Set1",
type="scatter", mode="markers") %>%
layout(title="PCA Plot")Multi-dimensional scaling (MDS)
Multi-dimensional scaling is another dimensionality reduction which aims to best reconstruct pairwise distances between a set of points given a set of distances. Since in the case of RNA-seq we are given the data vectors directly rather than the distance matrices, methods must calculate the distance matrix first upon which the MDS algorithm is then performed. PCA is used in the process to produce a reduced dimensionality projection from the similarities.
Here, we use limma (Ritchie et al. 2015) to calculate the MDS and use plotly to visualise the pairwise distances in 3-dimensions.
mds <- limma::plotMDS(vsdSE, plot=FALSE)
mdsOut <- mds$eigen.vectors[,1:3]
colnames(mdsOut) <- c("Leading logFC dim1", "Leading logFC dim2",
"Leading logFC dim3")
toPlot <- cbind(meta %>% rownames_to_column("Sample"), mdsOut)
plot_ly(toPlot, x=~`Leading logFC dim1`, y=~`Leading logFC dim2`, z=~`Leading logFC dim3`, color=~Condition, colors="Set1", type='scatter3d', mode='markers+text', text=~Sample, textposition = "top right") %>%
plotly::layout(title="Classical MDS", scene=list(xaxis=list(title = 'Leading logFC dim1'), yaxis = list(title = 'Leading logFC dim2'), zaxis = list(title = 'Leading logFC dim3')))Exercise #2
- How many genes are used by the function
plotMDSas inputs. What kind of genes are these? - Change the above function call so that all genes in the count matrix are used as input.
By default, the top 500 genes are used.
We must change the
topargument in order to include all necessary genes.mds <- limma::plotMDS(vsdSE, top=nrow(vsdSE), plot=FALSE)
Correlation Plots
We can also quantify the similarity between samples using correlations. This involves simply computing pairwise correlations between samples using the gene count vectors of each samples as inputs. The following methods this method as a basis.
By-Sample Dendogram and Correlations
We can use hierarchical clustering using the pearson correlations as inputs. This will group samples together in a hierarchical fashion in a tree-like structure.
d <- as.dist(1-cor(vsdSE, use="complete.obs")) # calculate the pairwise correlations
hc <- hclust(d, method="ward.D2") # run the clustering algorithm
WGCNA::plotDendroAndColors(hc, sampleColours, autoColorHeight=TRUE, hang = -0.1)
Correlation Heatmap
Similarly, we can call pheatmap on the correlations to easily visualize the similarity between samples within and across conditions.
# Pearson correlation plot
pheatmap(
mat = cor(vsdSE, use="complete.obs"),
treeheight_row = 100,
treeheight_col = 100,
cutree_rows = 2,
cutree_cols = 2,
silent = F,
annotation_col = meta,
annotation_colors = list(Condition = conditionColours),
color = brewer.pal(n = 9, name = "Blues"),
fontsize_row = 12,
fontsize_col = 12,
display_numbers = TRUE,
fontsize_number = 12)Top 2’000 Variable Genes Heatmap
# First, we center the matrix
vsdSECentered <- sweep(vsdSE, 1, rowMeans(vsdSE))
# Identify high variance features
topGenes <- rownames(vsdSE)[head(order(rowSds(vsdSE, na.rm=TRUE),
decreasing = TRUE), 2000)]
countsToPlot <- vsdSECentered[topGenes,]
# Clustering of high variance features
hmObj <- pheatmap(countsToPlot,
clustering_method="ward.D2",
scale = "row", cluster_rows = TRUE,
cluster_cols = TRUE, show_rownames = FALSE,
cutree_rows = 6, cutree_cols = 2,
treeheight_row = 50, treeheight_col = 50,
annotation_col = meta,
fontsize_row = 8, fontsize_col = 9,
annotation_legend = TRUE,
fontsize=8)
hmObj